Skip to content

Conversation

@aditya0by0
Copy link
Member

@aditya0by0 aditya0by0 commented Oct 8, 2025

This PR addresses a memory accumulation issue observed during training with the ResGatedDynamicGNI model when persistent_workers=True is enabled in the DataLoader.

⚙️ Existing Problem

  • The ResGatedDynamicGNI model performs per-forward random feature initialization for both node and edge features (new_x, new_edge_attr) on the GPU.

  • When combined with persistent DataLoader workers, these per-batch random allocations are not released properly because:

    • Worker processes remain alive across epochs.
    • CUDA’s caching allocator retains fragmented memory blocks.

While setting persistent_workers=True can improve performance when input features remain constant throughout training (as noted in the Lightning documentation), it becomes problematic when input features are dynamically initialized in each forward pass. In such cases, the DataLoader workers retain these transient tensors in memory, expecting reuse across epochs. Since they are never reused, this leads to progressive GPU memory accumulation and can eventually cause out-of-memory (OOM) errors. See the related issue and logs here
.

Refer: https://lightning.ai/docs/pytorch/stable/advanced/speed.html#persistent-workers

🧠 Root Cause

persistent_workers=True keeps worker subprocesses alive between epochs, retaining CUDA contexts and cached memory allocations that the ResGatedDynamicGNI model reinitializes each forward pass.

🔧 Fix Implemented

  • Enabled to set persistent_workers=False in all DataLoader through CLI for the ResGatedDynamicGNI model training.
    This ensures that:

    • Workers are restarted cleanly each epoch.
    • GPU and CPU memory are fully released after each epoch.
    • Memory fragmentation and accumulation are avoided.
  • Default is set to True as before, ensuring no disruption for other existing pipelines

@aditya0by0 aditya0by0 requested a review from sfluegel05 October 8, 2025 16:14
@aditya0by0 aditya0by0 self-assigned this Oct 8, 2025
@sfluegel05
Copy link
Collaborator

I don't see how the run you linked refers to any out of memory issues. The GPU memory allocation is at a constant 7.2% for the whole run.

Aside from that, having this option won't hurt so I am merging this.

@sfluegel05 sfluegel05 merged commit d52b422 into dev Oct 14, 2025
5 checks passed
@sfluegel05 sfluegel05 deleted the fix/persistent_workers branch October 14, 2025 10:56
@aditya0by0
Copy link
Member Author

aditya0by0 commented Oct 14, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants